DOCUMENT RESUME \* ' ■ 

TM 820 710 ■ 

Haertel, Edward 

Developing a Discrete Ability Profile Model for 
Mathematics Attainment. Final Report. ^ 
Education Commission of the States, Denver, Colo. 
National Assessment of Educational Progress.; 
Stanford Univ. , Calif. 

National Inst, of Education (ED), Washington, DC. 
[81] 

NIE-G-80-0003 

57p.; Tables are marginally legible due to small 
print. For related documents, see TM 820 707-712 and 
TM 820 716. 

MF01/PC03 Plus Postage. 

Cluster Analysis; Educational Assessment; Elernentary 
Secondary Education; *Item Analysis; Mathematical 
Models; *Mathematics Achievement; National Surveys; 
*Skill Analysis; Test Construction 
♦National Assessment of Educational Progress; *NIE 
ECS NAEP Item Development Project; Second Mathematics 
Assessment (1978) 



National Assessment , of Educational Progress (NAEP) 
Mathei^atics Assessment (1982) data were analyzed using latent class 
models to determine patterns of distinct skills required by different 
exercises and to estimate the pattern distributions. The populations 
were 9-, 13-, and 17-year-old examinees. Skills were assumed to be 
intermediate between objectives and report topics, such as "solving 
quadratic equations," and were treated as dichotomous - an examinee 
either did or did not possess the skill. At age 9 and 13, one 
assessment booklet was selected and nine clusters of three exercise 
parts were chosen. Twenty pairings of clusters yielded a 6-item set 
for analysis. At age 17, six exercise parts of apparently common 
skills were drawn from each of six booklets. Item clusters which 
could be collapsed and organized hierarchically were indicated by the 
latent classes. For each cluster, all analyses including it were 
examined together, yielding separate estimates of the proportion of 
examinees able to solve items in that cluster. The distributions of 
these estimates were an indication of the cluster's conformity to the 
assumption of skill dichotomies. Student mathematics skills at each 
age level are reported. Results of the use of NAEP data tapes are 
reported and improvements in the methodology are suggested. Primary 
type of information provided by the report: Results (Secondary 
Analysis) . (CM) 



ED 222 556 

AUTHOR 
TITLE 

INSTITUTION 



SPON.> AGENCY 
PUB DATE 
GRANT 
NOTE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



ABSTRACT 



*********************************************************************** 

* Reproductions supplied by EDRS are the best that can be made * 

* from the original document. * 
*********************************************************************** 



U.S. DErARf*«ENT OF EDUCATION 

NATIONAL INSTITUTE OF EDUCATION 

^ ' " ■ .^EDUCATIONAL RESOURCES INFORMATION 

Q , . . CENTER (ERIC) 

. . . XThis document has been reproduced as 

received from the person or organization 

Developing A Discrete Ability Profile Model ^r'an^s have been m^^^ 



FINAL REPORT 



Edward Haertel 
Stanford University 



reproduction quality- 



for Mathematics Attainment , poims of view or opinions stated in this docu- 

ment do not necessarily represent oHicial NIE 



position or policy. 



The work upon which this publication is based i5 performed 
Grant NIE-G-80-0003 of the National Institute of Education 
however, necessarily reflect the views of that agency. 



pursuant to 
It dees not , 



Abstract • , . 

. • Data from the year 9 mathematics assessment were analyzeji using latent 

class models to determine patterns of distinct skills required 

by different exercises, and to estimate the distributions of these skill ? 

patterns in age 9, 13, and 17 populations. At each of ages 9 and 13, 9ne . 

booklet was selected, and nine clusters of three exercise parts each were 

chosen from that booklet. Over twenty pairings of these three-item clusters 

yielded- the six-item sets analyzed. At age 17, six exercise parts that 

appeared to require the same- common skills were drawn from each of 'six 

* 

different booklets for analysis. 

At age 9, roughly 89 percent of the children could do simple addition 
problems correctly. Subtraction, counting, working with place values, 
common measuring units, and simple geometric concepts were available to 
roughly 77^^rcent of nine-year-olds. Only ^^ percent could do mare 
difficult computations, and only 53 percent some problems requiring use of 
a-ruier. Skills at age 13 were less clearly distinguished. Number line 
problem:^ and simple computations cou-UI be solved by 7^ percent of 

i 

thirteen-year-olds. The .radiciil sign was in ter pretable -to 66 percent, but 

i 

only 30 percent could give decimal equivalents to common fractions. Skills 

i 

for various algebra topics were av liable to ^0 to 60 percent of children 
at this age. From 85 to 92 perce: t could do unit conversions and knew some 
geometry facts and concepts. Th'- age 17 analyses showed 52 percent unable 
to solve any but the most elemencary exercises requiring an understanding 
that letters may represent ' variable numerical quantities. Another eight 
percent, could solve some such exercises by reasoning, but could not solve 
those requiring formal algebra training. 

No problems were encount-red with the MAEP tapes. Improvements in the 
methodology, as well as areas where further work is required, are reported. 



TABLE OF CONTENTS 

Abstract . . , i 

Table of Contents . . , .ii 

Lis(: of Tables and Figures. iii 

Introduction. 1 

Conceptualizing Mathematics Attainment 3 

Items as Skill Indicators . . . . ■ 6 

Carrying Out the Analyses . ' " . , 9 

Strategy fo-r- Analysis . . . . .' . . . . 16 

Using the NAEP Data Tapes 18 

Results and\ Inter pretation : Age 9 20 

Results and Interpretation: Age 13 .. . 28 

Results and Interpretation: Age 17 . • 37 

Conclusions . . . . ■ * A5 



\ 

\ 

\ 



V. 



4 



T 



^ LIST OF TABLES AND. FIGURES 



0 



Table' 1 




Age 9 Estimates of Misclassif ication Probabilities by Item. . 


26 


Table 2 


• 




27 


Table 3 


• 


Age 13 Estimates of Mi sclassi.f ication. Probabilities by Item . 


33 


Table 4 


• 




35 






Exepe-i-s-e-s Ex ami-ned'-in . Age 17 Ana lys-es , p- values , and Laten t 










39 


Figure 


1. 


Age 9 Booklet 3 Stem-and-Leaf of Crude Item Difficulties . 


21 


Figure 


2. 


Age 9 Analyses and Proportions in Each Skill Pattern . . . . 


23 


Figure 


3. 


Age 13 Booklet 6 Stem--and~Leaf of Crude Item' Difficulties . 


28 


Figure 


4. 


Age 13 Analyses and Proporti6ns in each Skill Pattern. . . . 


31 



0 



0 



ERIC 



ill 



DRAFT FINAL 'REPORT: Developing a Discrete Ability Profile Model 
for ■ Mathematics Attainment 

The National Assessment of Educational Progress affords a rich resource 
for the description of American young people^s academic skills. Careful 
deliberation as to appropriate objectives, matching of itenj^ormats to 
— ob-j-eatd-veS"v~rnci-udi-ng"-the-us-e^ as "neces'sary v "and : ^ 

use of all (technically sound) exercises regardless of their "item 
statistics" yield item pools that tap an unparalleled array of jfepecific 
competencies. In contrast;, traditional tes'ts sample a narrower range of 
objectives, often constrain all i'^ems to a common format, and typically 
iri4^ude only items that are highly correlated with total score or some 
other, single criterion. Matrix sampling permits the administration of 
literally hundreds of exercises, all to nationally representative respondent 
samples, without unduly burdenrng any individual pupils or schools. " - 

Unfortunately, researchers seeking to exploit these data to find out 
what children l^ave learned or can do may be overwhelmed by the embarrassment 
of riches.- Accustomed to conceiving items as multiple indicators of a 
single ability, they are ill-equipped to draw useful conclusion s from 
hundred of diverse items, each of interest in Its own right. The most 
typical response has been to attempt to build scales, using items that 



G 



appear- to be related, and restricting attention to one .exercise 'booklet 
(package) at a time. 

Tliose turning to NAEP publications of findings will find little 
additional guidance. Here, results are aggregated over packages, but 
exercises are typically pooled into broad "report _tp,pics" categories 
spanning-manT-oUjective's , simply beca^use objectives are so numerous. The 
only direct longitudinal comparisons made are of performance on identical 
sets of items at different points in time, and for descriptions of absolute 
level of performance at a single time point, the reader is told, "30^ of 
^nTne~year-ords"W this problem," etc. It is then 

left to the reader to imagine what general^^zations from these statements 
are or 'are not appropriate. Only 1% of seventeen-year-olds could correctly 
solve the equation. "(x-2f = 9" for x', but in another sample, 18^. could 
"Find the solution set of " x^ - 5x + 6 = 0."* Is the difference due to the 
wording of the problems? The particular numbers? The format of the 
equations? How are we to generalize about the proportion of 
seventeen-year-olds who can solve quadratic equations? How would the 

'p-vaiues change if -these -items were multiple choice, say, rather than free 

response? There is no way to tell. 

Here is a dilemma. The very brepdth and richness of the data base make 
it resistant to all but the most superficial summaries! The conventional 
model of items as multiple indicators of a single ability cbntinuurn is not 
appropriate, and there appears to be no alternative but to compose average 
p-'values for masses of (exercises. There is clearly a need for better 
methods, of reporting NAEP findings. The purpose of this study was to 
investigate one possible method. 

^Exercises S0^29A and S0905A from year 09 mathematics assessme'nt. 



7 



3 . 

Con6eptuall2ing Mathematics Attainment 

An empirically grounded summary of NAEP findings must be based on some 
model for exercise response data* Traditional models, both classical and 

C 0 

latent trait, are most often appropriate when (1) there are large numbers 
of items that may be assumed un id imensional , and (2) tne intent is to m.ake 
inferences about individual examinees' levels of the ability underlying 
"responses to all the items. In contrast, the NAeK data offer (1) only a 
few exercises tapping each of many abilities and (2) a problem of describing 
per f o r m a n ae- -a &&f e g a t ed -a o r o s s— i n d-i- v4 d ua 1 s- ^ C 1 ea r 1 y , ■ th eh-, - a- -d i f f e rent ■ ki n d 

of model "is needed . 

For this study, .the kind of model chosen was a restricted latent class 
model similar to those first investigated by Lazarsfeld and Henry (1968). 
and applied to item response data by Proctor ( 1970), Dayton and Macready 
(1976). Goodman (1975), Haertel (1980). and others. The basic "unit" of 
ability assumed in these models may be termed a "skill."- For purposes of 
this study, skills were assumed to be intermediate in. scope between 
objectives and report topics. For example, "solving quadratic equations," 
or "making conversions among common units of measurement" might be skills. 
A distinctive feature of the models is that skills are treated as 
dichotomous: A given examinee either does or does not possess the skill. 
If an item requires only' skills an examinee possesses, then s/he can solve 
that item. If it requires even one additional 'skill, s/he cannot. When 
applied to individual-level data where the intent is to distinguish a 
continuous range of ability levels, this might not be a useful assumption.- 
For describing populations, however, -it works well. With respect to 
any given skill " dichotomy , examinees can be partitioned (conceptually) into 

\ ' 

0 



just two groups: those that possess the skill and'those that doNnot. The 
proportion in the first category is' sufficient to summarize completely the 
distribution of that skill in the population. ' 

] When multiple skills are considered, it may be that hierarchies will 

/ 

h'e found. For example, it is unlikely that any students possess the skill 
•of solving quadratic equations but lack the skill of solvin-; 
linear equations in one unknown. Thus, these two skills together 
define only three groups of examinees:, those possessing neither, only the 
linear, or both. On th.e other hand , students who^ can do conversions among 
common units of measurement* may or may not be able to so^lye linear 
equations, and conversely. These two skills define four groups of 
examinees: those possessing neither, only the first, only the second, or 
both. The set of all skills defines a very large number of posslbl^x skill 
prof il'es , and under these models, a list of the proportions of oxamine^es. 
in each possible profile would completely describe the distribution of 
mathematics abilities in the population. While such a complete 
cross-tabulation of skills possessed or not possessed -is not obtainable in 
practice, various marginal tabulations involving subsets of the skills can 
be estimated, and provide useful generalizations across items. 

Before explicating details of the models, estimation procedures, tests 
of fit, etc., it will be useful to consider more carefully the 
reasonableness of the dichotomy assumption, and sourceis of skill 
distinctio.ns in the population. When a single examinee encounters a single 
exercise, it seems reasonable, to suppose that s/he either can or cannot get 
it right. If the exercise could be given the same child repeatedly with 
perfect forgetting between administrations, we would expect either nearly 
all incorrect responses or nearly all correct responses, with the 
iteper fections due only to'guessing or to car elessnes-^ . Thus, it does seem^ 
reasonable that a single item defines a dichotomy, and not a continuum. 



After all, responses are scored^ dichotomously , so an item in isolation can.' 
give us no more in,fprmation than "probably present'* or "probably absent'* 

concerning whatever skills it requires. Next, if two items are considered, 

f 

each will define such a'binary partition ». Examinees, will be able or unable 
to solve the first, and able or unable to solve the second.*' If the items 
are quite similar in conteni, it may be a good approximation to suppose 

0 . ' 

that they in fact define the same latent dichotomy, and that examit^e'es 

who get one rr^ht and 'the other wrong have guessed a right answer ojr else 

have been careless in giving a wrong one. In fact, if we are free to 
"estimate the probabilities of "false positives" and "false negatives" for 
the two items as well as the proportion who really can solve the two items, 
''such a model can always reproduce perfectly the actual frequencies of each 
possible response pattern to two, or even three, items. With from four to, 
six items that are homogeneous in content, this model, with its single 
latent dichotomy, is virtually indistinguishable from continuous 
unidimensional latent trait models in its ability to reproduce actual 
frequencies of di f ferent 'patt;erns of item responses.*' Obviously, with a- 
•large pool of items testing the same complex ability, we" can distinguish 
Mnore than two levels of ability in examinees. The point is that with fewer 
and/or more homogeneous items the alternative as'sumption , that examinees 
either have or'ha.^e not acquired the skill, may be as good or better for 
some purposes. ^ 

^Findings are from unpublished invetJtigations by the author/ Comparisons 
■ were made between the two-state lat'^nt class model and the two-parameter 
normal ogive model (Bock & Lieberman, 1970), which has an almost identical 
number of parameters. 



The. skills assumed to be mojasured by NAEP exerci:p.es are not basic 

psychological processes, but arise largely from the organization of tl^e ^ 

mathematics curriculum in American schools. Educators have chosen' to parcel 

but the content of mathematics iato courses suoh as Algebra I, GeometPy, 

etc. Topics us.ually taught together ai^'e likely to cohere in single skills, 

but if the organization' of the curriculum were changed, the skills. would 

change, too. Of course, if some topics require that students al^tain a given 

4 ... " , ... ^ . 

developmental level before they can be acquired, then skill patterns 

involving those- topics will reflect cognitive processing capabilities as 

well as curricular conventions. No attempt was made in this study to 

disentangle sources of skill distinctions. The objectivu-s were to determine 

what skills need be assumed, and how prevalent they are in the population. 

Items a3> Skill Indicators * . 

In the latent class models used in this study, each exercise is assumed 

to require one or more distinct :.kills, and examinees are assumed to either 

t ■ ■ ■ ■ 

possess or not possess each skill. The skills an exercise' requires ^nd an 
examinee possesses are the sole. determinants of the probability that the 
examinee will answer correctly. There will be one (low) probability of a 
correct response by an examinee lacking one or more requisite skills, and 
a second (high) probability of a correct response by. an examinee possessing 
all the skills the exercise requires. Responses to different exercise parts 
are assumed conditionally independent given skill possession. ThUs, for 
any single examinee or for any group of examinees who possess the same 



skills^ the probability of a pattern of responses across exercises is j^st 
th.e pr6duct of the separat^^ probabilities of each components^ response. It 
is this assumption^'of conditional incjependence that makes it possible to 
estimate' parameters of the latent class model. An exactly analogous 
assumption is 'required In all commonly used latent trait models for item 
response da t.a, as well. 

It follQws from these assumptions that. a given exercise can be 
completely characterized by telling (1) what skills it requires, (2) its ^ 
.raise positive rate , i.e., the (hopefully low) probability that any 
examinee lacking one' or more requisite s'kills will answer it correctly, and 
(3) its true positive rate , i.e., th^( hopefully high) probability of a 
correct^ response b-y any examinee posseissng all the requisite skills.. The 
hypothesized skill . requirements of each of a set of exercise parts determine 
what latent^ classes are-assumed when that set is analyzed. Following the ^ 
analysis, a fit statist^'ic and residuals are examined, and the hypothesized 
skill requirements are' revised as necessary to obtain a satisfactory fit 
to observed response pattern proportions.* The analysis consists of 
estimating simultaneously th^ proportions of examinees in each assumed 
latent class, and the false positive and true p'ositive resporise 
probabilities (FP and 'T-P .rates) for each exercis6 part. The p-valu*fe for 
each exercise part is a weighted average of its FP and TP rates, the weights 
being the total proportions of examinees who should be unable to solve and 



*Obseryed response pattern proportions are the fractions of examinees who 
each give possible pattern of correct and incorrect respon ses 'to the 3et 
cf exercise parts being anal^^zed. 



12 



able to 3olve that part,, respectively. 

The estimated FP and TP rates are,,assumed to be parameters of the item 
itself, and in fact are found to be quite stable across estimations where 
the item i^ included in different sets. FP rates indicate susceptability 
to guessing. They are typically higher for^multiple choice than for free 
response -Items , but may also be high for items that invite some alternative 
solution that circumvents thti intended skill requirement. For example, one 
free-response item presents the stimulus 



\ 



I I - 19 = 32 \ 



\. 

\ 



arid asks, "what number should go in the to make this number ^sentence 

TRUE?"* When this item is coded as requiring the sa.me skill as two other 
items that require a basic understanding of equations, a satisfactory fit 
is obtained (about 63% Qf thirteen-year-olds possess the skill), but the 
false positive rate is estimated at .22. This reflects not guessing in the 
usual sense, but the possibility of obtaining t|ie correct answer without 
understanding, by one of the few possible elementary operations that can 
be performed using the numbers shown — the answer is 19 + 32. 

The TP, -rate "should reflect primarily carelessness in responding, but 
also, can indicate idiosyncratic informational requirements or unique 
difficulties of single exercise pc^rts. For example, a skill of converting 
units was assumed to ^underlie three (unreleased) multiple choice items 
requiring conversions between quarts and gallofis^, ounces and pounds or feet 
and yards, respectively.** While models assijiming a single skill did fit 

*Exercise T0608A from the year 09 mathematics assessment. 

' ■ ' ■ ' ■■ i ■ ■ 

**T0615B, T0616C, and TO6I6D from the year 09 mathematics assessment. 



these items (roughly 53% of thirteen-year-olds possess the skill), their 
estimated true positive rates were on the order of .84, .90, and .87, 
respectively. Thus, 16% (1 - .Q^) of those who possessed the skill lacked 
speci-fic information on quarts and gallons, 10% lacked specific information 

on ounces and pounds, etc. 

The difference between an item's TP and FP rates may be interpreted 
■ as an index of item discrimination.* A large difference indicates that the 
item reliably distinguishes between those who do and do not possess the 
skills required for its solution, whi^ a small difference indicates that 
correct responses by nonmasters and^masters are more nearly equally likely. 

Carrying Out the Analyses 

This section describes the computational steps required in applying 
latent class models with the NAEP data. The overall approach taken in 
studying mathematics skills, of. -course, includes more than these technical 
procedures. . In the next section, "Strategy for Analysis'\ substantive 

decisiorTs are described, including questions as to which booklets to 

if " 

' examine, which items to group together, and how to integrate findings from 
many separate analyses to reach meaningful conclusions. Before turning to 
these broader issues, details of the methodology are described below. 



^ , TP (l-f^P) 

*An optimal discrimination index is the log odds ratio, In ^ i^TF) ' 

which can be .used in exactly the same way as the discrimination parameter 
of the two-parameter logistic model in constructing an optimum scoring rule 



u 



The latent class analysis begins with a summary of the data that tells 
what proportion of examinees gave each possible pattern of correct and 
incorrect responses to a set of items. There are both advantages and 
disadvantages to this approach. On the one hand, the costs of computation 
are little affected by sample size, since response pattern proportions may 
be obtained in a single pass over the data file, and need not be 
re-calculated each time a new model is fit to the same items. The number 
of possible patterns, hence the number of proportions, is independent of 
the N. In adaition, issues of weighting and other adjustments for the NAEP 
sampling design may be handled prior to the actual analysis. On the other 
hand , summarizing the da ta according to res ponse p attern proportions 
severely limits the number of. items or exercises that can be examined at 
one time. The number of possible response ' patterns is four, for two items 
(00, 01,- 10, 11), eight for three items (000, 001, 010, Oil, 100, 101, 110, 
111), sixteen for four items, and in general, 2 for k items. In examining 
.the NAEP data, sets of six exercise parts were employed. Thus, data for 
all respondents could be summarized in just 6^1. numbers, "corresponding to 
all possible patterns of correct and incorrect responses,' from 000000 to 
11111U Since the intent of this study was to distinguish the specific 
skills required by individual exercise parts, sets of six at time were 
quite oufficient. However, in order bo attain a clear conception of the 
skills assessed by the entire exercise pool, it was necessary to integrate 
findings across many separate sets of items. 

After obtaining response pattern proportions, the next step was to fit 
a latent class model to these proportions, in order to determine which of 
the exercise parts could be assumed to require the same skills and which 
must be assumed to require different skills. In each single computer run," 



15 



just one model is tested. Based on the fit .of this model, it is either 
accepted as accounting adequately for the data obtained, or rejected . 
In the latter case, information on just where and how thie model failed to 
fit (analysis of residuals) is used to guide the selection of a new model 
to try, and the cycle is repeated until an acceptable solution is found. 
It is for this final model that item misclassification parameter estimates 
and latent class proportions (estimated proportions of respondents 
possessing each combination of skills) are most carefully examined. 

Specific steps of the procedure are as follows. After choosing a set 
of six exercise parts to be analyzed, a table is prepared showing the 
location of each part in the file and its correct response value. The 
original Fortran program used to access the tape requires a one-column 
numeric field for each item, so for free-response items the first column 
of the (typical) two-column field was examined for a '^1^'; Control cards 
for the program, called PULL, specify logical record length, column numbers, 
and keys for each item, and the first and last column of the .weight 
variable, if used. Additional information as to which tape file to access 
and the name to b*e given to the output data set is coded in the JCL for., 
the PULL run. For all analyses, responses were weighted using the variable 
WEIGHTS, ^'student weight.'' The PULL program read each record, scored the 
items^ to determine the response pattern, and incremented a tally for that 
pattern by the weight for that record. Thus, the total weight for each' 
pattern was obtained. These were then divided' by the grand total weight 
to obtain weighted response pattern proportions, which were written bo a 
small data file along with the original^ grand total weight. The PULL 
program could actually process up to six item sets at a time, producing six 
of these small data files from a single pass over the data. The cost of 
a typical run to produce six data sets for a single exercise package at 
normal daytime rates was $3 . 32 ^^including a $2.00 charge for mounting the 



16 





■ . - ■ • — - • ■ . — 






tapes.* 






The small files were next edite(i~and copied into a "library" file fQr 






storage. Since each could be^ contained on only 12 80-column cards, it was 






most efficient to keep them online. The editing required was to replace 






the total weight ,with the actual N, adjusted according to design effect for 






":the sample. What was done was to set the N equal to one half the number 






of records in the file, i.e., to assume a design effect of 2.00. While this 






■adjustment cannot be justified rigorously, various indirect tests have 






indicated that it is satisfactory in practice (Haertel, 1980). 






The next logical step in the analysis was to input the response pattern 






proportions to a second For tran program**, MLMN , which produces maximum 






likelihood estimates of latent class proportions and misclassif ication^' ~ 






*In one case, a .slightly more elaborate procedure was used, in order to 






combine four exercise parts into a single overall exercise score. An 






additional Fortran program was used to read each record from the tape, score 






that one exercise, and write the record to a temporary disk file. The disk 






file was then input to the PULL program. This was done for exercise parts 






T0632A, T0632B, T0632C, and T0632D from the year 09 mathematics, assessment. 

. V • ■ 


• 




^**This program was written by Richard Wolfe, currently at the Ontario Insti- 






tute for Studies in Education. It has been modified by the Principal Inves- 






tigatoc for use in this study, but is not generally available. However, 






newer programs incorporating more recently developed estimation algorithms 






are generally available and can be used to carry out analyses like those 






reported. The principal advantages of the MLMN program are superior numeri- 






cal accuracy, calculation of asymptotic standard errors and covariances of 






the estimates, and detailed analyses of -residuals useful in deciding how 




• 


6 

to revise provisional models to improve the fit. 




ERIC 


„ ' . ■ 17 _ - 

'''if ' f 





probabilities, together with associated statistics. However, because the 
control cards required for MLMN are quite complex, another Fortran program, 
PREP, is executed first, 'to prepare the stream of control cards for MLMN. 
The "PREP program n-eeds very brief control cards telling which latent classes 
aro' to be included and giving labels for the model, the six-item set, and 
each separate exercise part. Output from PREP is a jobstream which can be 
input to MLMN. 

The first run for each set was *l:o~''f it the simplest possible latent 
class model, with only two latent states. These are labeled the "null'^ 
state and the ''all" state, and are included in any subsequent models, as 
well. It is assumed that 'examinees conforming to the "null" stat\e cannot 



solve any of -the^ six exercise pSrts, so any correct response by th\e-se 
examinees must be a false positive. Examinees conforming to the "all" state 
are assumed capable of solving all items correctly, so that any correct 
responses they provide are true positives, j^nd any incorrect response^are 
false negatives. 

MLMN estimates the .proportions of exam^nues in each latent clas~s, and 

probabilities of correct responses to each item by members of each class. 

'I 

Fit is assessed by both likelihood ratio and Pearson chi square statistics. 

If the fit of this, initial model is satisfactory (as indicated by a 

non-significant chi square), it indicates that^the assumption of a single 

underlying skill dichotomy common to all six items cannot be disconf irmed . 

In this case, analysis of this exercise set is completed. Because sets of 

'^exercises are chosen td- detect distinct skills, however, it more often 

happens that the chi squares- for this initial, model are unacceptably large. 

In this case the analysis of residuals is consulted to determine whau 

* 

additional latent classes" may be required to improve the fit. It should 
be explained that the parameter estimates arc used to predict the 
proportions of examinees expected for each response pattern. These fitted 



18 



(predicted) values are subtracted from the observed values to obtain 
residuals. If the model fits the data, residuals are expected to be small 
and. non-systematic. If the model fitted does not adequately account for 

r 

the observed response pattern proportions, systematic patterns in the 
residuals usually offer clues to more complex models that would do a better 
job of accounting for the data. The problem . in ^ analyzing residuals is to 
detect these patterns. 

Several methods for analysis of residuals have been tried by the 
author, and the problem is by no means solved. The best method thus far 
is as follows. Consider the residuals for the 6'^ response patterns to be 
entries ina2x2x2x2x2x2 table, with each dimension corresponding 
to an item, and the two levels of the dimension correspondi.ng . to incorrect 
("0") or correct 1 ") -responses to that item. Then, obtain all possible 
marginals of the table by summing across every possible combination of one 
or more dimensions. , For each table formed by collapsing across just one 
dimension . there will be 32 marginals. When two dimensions are collapsed 
there will be 16 marginals, etc. Of these, list, for each possible 
collapsing of the original table, the one marginal for ^'correct" responses 
to all remaining items. When one of these values is large, it suggests that/- 
"the items not collapsed in obtaining that value. covary to a greater extent 
than the current model permits. Thus, a new latent class corresponding to 
examinees who can solve those items .but not'the remaining items is called 
for. In other words, the analysis qf residuals permits identification of 
subsets of exercise 'parts^ that require a common skill, not required by the 
.remaining e)^ercise parts. When the corresponding latent class is. 
introduced, the fit is generally improved, • • 

After the model is revised and a new run is made, a difference chi 
square is calculated. This is .simply the difference between the- original 



19 



ID 



and new likelihood ratio chi squares, and is asymptotically distributed as 
chi square on one degree of freedom, if only one new parameter was 



introducS^d and if the less inclusive model fits the data. A significant 

difference chi square indicates' that the additional latent class should be retained; 

if the overall fit is still not satisfactory, additional latent classes must 

be introduced, based on analysis of residuals from the new mpdel. If the 

difference chi square*" is not slgni f icant , . the new latent class is not 

retained in the model, and a different latent class is introduced instead. 

This procedure is repeated until a satisfactory fit is obtained, no 

additional latent classes can be found that yield further improvements in 

fit,' or the estimated proportions of examinees in successive states fall 



below about 2%. These latter situations are rare; in almost all cases, the 
clrTi^^S-quare becomes acceptably small with no mor% than four latent' classes, 
including the^"^null" and "all*' classes. Acceptable fit is defined by a chi 
square below 70 (usually^^below 65). Because the design effect adjustment 
directly affects the value of the chl^qua7-el,_j^'^gid adherence to a precise 
significance level in interpreting chi squares is not appr6l)rTa-te-..__ Oii 
squares of 6"5 to 70 on ^8 to 50 degrees of freedom are significant at ■ 
roughly the .05 level. - 



^Adjusting for a design effect of 2 will halve the value of both the likeli- 
hood ratio and the Pearson chi squares, will increase all standard errors 

by a factor oT -^2, but will not affect the magnitude . of any parameter esti- 

i 

mates, fitted response- pattern ;pr,.oportions , or residuals. 



/J 



ERIC 



20 



strategy for Analysis 

This section -describes the use. of latent class models in examining skill 
distinctions in mathematics attainment . The' first step for each age level 
was to choose a representative exercise package (booklet). The appendices 
were dumped from tape, and Appendix 4 was used to find the booklet with 'the 
best representation across the areas of algebra, geometry, measurement, and 
computation.* Microfiche of the act^ual booklets was also consulted in 
making this determination. The codebook file for the selected booklet was 
then dumped, and hard copy was prepared • from the microfichig for the booklet 
selected. . 

, On the basis of Appendix 4 classifications as well as direct 
inspec'fcions of the exercises,, all cognitive exercise parts were classified 
as Computation, Algebra, Geometry, Measurement, or Other. (''Numbers and 
Numeration" exercises were classified under computation.) Within each 
category, a stem-and-leaf of "crude item difficulty'-' was prepared. These 
"crude item di f f icultic-s" were the (unweighted) proportions of correct 

responses reported in .the codoDook. 

.... . — - ■■ 

Under the assumptions, of. the latent class models, there is, of course, 
no necessity that items requiring the sam'e skills be of similar difficulty. 
The item difficulty reflects not only the proportion of examinees possessing 
requisite skills, but also each item' s unique misclassif ication 

\^ • ■ ■ " 

in future assesbfl^nts exercises are packaged to provide topical coverage 
(i.e., if, different booicl^<^s focus on different themes), it would be help- 
ful for this kind of analysisN^ at least one or two booklets were organized 
according to the present practice/^\^th a few exercises of each of many types 



probabilities, reflected in its item parameters (true positive rate and 
f^lse positive rate). Nonetheless, it is more likely that items similar in 
difficulty will require the same skills than test items of widely differing . 
difficulties. Thus, the stem-and-lecjf s were useful in identifying initial 
item clusters.* Exercise parts within a category and with similar crude . 
difficulties wer/examined , and subsets of three were chosen that appeared 
relatively homogeneous in conten.t. In some cases, these were three parts 
of a single exercise. In other cases, two or three distinct exercises were 
represeated. These. triples , similar in content and difriculty, were the units 

-■ , / ^ 

used to form the sets of six exercise parts analyzed at ages 9 andJ3- (For 
age cycle 17, a different procedure was followed.) Thus, each analysis at 



ages 9 and 13 involved exactly two three- iFe'm cluster s . A two-class model 
was fit to the six exercise parts, and if satisfactory fit was not obtained, 
additional states were introduced based on analysis of the residuals. For 
all but 3 of the 47 sets analyzed at ages 9 and 13, the additional skill 
distinctions required distinguished between the two clusters but not within 

Clusters,, In.^ other words, the three exercise parts within each cluster 

required, identical skills^. - 

Nine 3-item clusters were identified at each of ages 9 and 13. Thus, 
a maximum of 36 runs at each age were possible taking all clu^ster pairs-. 
Resource limitations prevented carrying out all these analyses. (Recall 
that one "analysis" could require as many as six o? more runs.) Twenty-six 
analyses at age 9 and twenty-one at age 13 were selected. In order to 
locate all six-item sets for which a single dichotomous skill could' be 
assumed (sets for which the two-class model fit), or. for which skills would 
be hierarchical (one triple requiring a subset of those skills the other 
required, as'indicated by an acceptable fit for a three-class model), 
adjacent clusters within a contend strand ''were paired first. Additional 
analyses were then included which involved clusters of similar difficulty 



in different content areas. 

After an, analyses v;ere performed, results were summarized in figures \ 
and:tables, and examined to determine which, if any, item clusters could 
be, collapsed and which could ^e organized hierarchically. This was 

a 

indicated by the latent classes required ijn each run. Next, for each 

, ., , , ' ». 

clusVe^, all analyses .including that cluster were examined together.' Each 
of these analyses yielded a separate estimate of the proportion of examinees 
able to solve items in that cluster. The distribution of these estimates 
is one indicator of the cluster's conformity to the assumption of skill 
dichotomies. The separate estimates of each item's true positive and false 
positive rates were also, tabled. 

Mapy exercises in each* chosen booklet were excluded, from the analyses. 
A basic framework was developed to encompass most exercises at each age 
level, but available resources did not 'permit the additional analyses 
■necessary to include all the exercise parts in the booklet. Extension of 
the basic structure is straightforward logically, but costly in human and 
computer time, as additional. content clusters must be paired with increasing 
numbers' of clusters already identified. 
Using the NAEP .Data Tapes 

The tapes and documentation provided by the National Assessment were 
outstanding. Compared- to other large data sets I have worked with, there 
were virtually no problems with either tapes or documentation. All but one 
or two of the runs directly accessing the tapes ran correctly the fir^st time 
they were submitted. ' • 

IBM standard-label tapes were provided by ECS. "Machine-readable 
documentation files (fixed-length records with ASCII cont^rol^char acters) 
•were dumped using lOPROGM, a locally-written utility -program'. --.Data were 



23 



read using formatted READ statements in an original Fortran program. Record 
layouts were easy to follow and appeared absolutely accurate. Weight 

variables^ were well-documented as well, I found it especially useful to have 

the unweighted frequency distributions of each variable in the Codebook 
files/. My, use of these distributions was described earlier. 

The few improvement.^ I suggest below are minor; I was satisfied with 
the tapes exactly as provided. 

- It would be helpful to have appendices and codebook files on 
microfiche as well as or tape. As i't is, it may be necessary 
to dump thousands of lines to answeY. a single, simple question. 

- Jhere is a good deal of redundancy in appendices for different 
age levels. If appendices were in separate files and if redundant 
files were identified in the printed documentation sent by ECS, 
some needless printing ' could be avoided. 

- Some structure and consistency in the data is not documented in 
such a way that the user can take advantage of it. For example, 
the WEIGHTS variable is in the same cQlumns in all booklets I 
examined at all age levels, but I did not find any summary- 
indicating what porti'on of the data records is fixed across booklets 
and ages. 

~ For many purposes, it would be useful to- be able to construct 

interpenetrating subsamples within each booklet, e.g., jackknifing, 
crossvalidation , or replication of analyses to estimate standard 
errors. .Construction of interpenetrating subsamples would be 
facilitated if a one-column ^'subsample ID'* were included on 

A 

, V 1 ■ 

(? 

\ ' ' ' . . ' 

I 



24 



' each record, to > be used iri~allocating records to^separate subfiles. 

Wivthin/strata ,,. PSUs would be randomly allocated to subsamples. 
' Records within PSUs would not be divided, to avoid an^ possible 
* error- covariation between samples 
Results and Interpretation: Age 9 

' .■ '^ 

Figure 1 displays stfem-and-leaf s of crude item difficulties for allc-v""' 
exercise parts in Booklet 3, Age 9. Homogeneous sets' of exercise parts have 
been circled, and the clusters (triples) drawn from each set are indicated. 
The three "Alg," or Algebra items include two using a number line and one 
asking' which number sentence would be used to solve an equation. All three 
were mol tiple-choice exercises. The "Add" and "Sub" clusters include simple 
addition and subtraction problems presented oral«ly. 'These were 
free-response. "Count" included counting sets of small illustrations in 
the exercise booklet, by ones, by twos, and by tens. Tha first of these, 
counting squares by ones, was free response. The latter' two, counting ^ 
shoes in pairs arid marbles in bags of ten, were multiple-choice. "Place" 
items included giving the place values of specific digits in numbers, and 

o 

an exercise like writing "ten sevens".''^ All were multiple-choice. 'The more 
difficult computation clusters, "SMD2 , " included two- or three-digit sub- 
traction, multiplication, and division problems. These were similar to''^ 47 

605 - 328, and 56 -? 4, all free-response. "Geo" (geometry) included 

■ ■ ■ ■ • ^ 
m^ultiple-choice items asking in which figure the halves would aot match 

if it were folded, which figure illustrated parallel lines, and which 

illustrated three line segments that. could not make a triangle. The- 

two-choice "units" items asked which is more: a yard or two feet, two^ 

quartsTor three pints, two dimes or three nickels. Finally, the 

free-response "Ruler"' items involve using a. ruler to measure a'line, 

^Actual content of secure exercises -has been modifleid to avoid disclosure. 



"21- 



measure the distance around a triangle, ^R^d draw a line of . a specified 
length. (Exercise numbers assigned to each of these items in Booklet^ 3 are. 
shown i^Table 1, discussed below.) - " "^"^ 

" '■ ' . Figure 1' ' , ■ ' 



Age 9 Booklet 3 3teiti-and-leaf of Crude Item Difficulties 



Computation 



Geometry ^ 
58 




Measurement 





Other 



\ 



ERIC 



26 



The 26 analyses run on these clusters are diagramme'd in Figure 2. Each 
line connecting two cluster^ rexpres^ents one analysis. The-anal'ysis number 
(1-26) is written along t^he line/ along with thVee -proportions separated 
by tick mafks, which s/mmarize the skill distlactions found in that 
analysis. The center proportion inclicates , what fraction of examinees were 
able to solve all/exercises in 'both clusters. The proportion.s on the ends 
closest to eacj:i'^^cluster give the proportion able to solve items in^that 
cluster ^but nVt the other. If .one of these is 0, a ^hierarchical 
relationship is indicated.. Np examinees are able. "to solve those items who 
cannot also solve the items in the other cluster. For example, Analysis 
7, involving the Geo and SMD2 clusters, yielded the proportions .158/.UUU/0. 
T|yLs irdicates that 15.8% of the examinees could solve Geo but not SMD2 

items, UU.U% could solve either type, but none were capable of solving the 

• ■ ■ ^ I 

SMD2' items and not the Geo items. Thus, Geo items fall below SMD2 items 
<«. 

in a hierarchy. Note that the remaining 39. 8J of the examinees (100% - .15.8% 
- UU.UJ) could- not solve any of the six exercise parts in this analysis. 

In some cases, two of the three estimated values were zero, as in 
Analysis 11 .involving the SMD2 and Alg clusters. In this^case a single, 
common skill distinction underlying exercises in both clusters proved 

o 

sufficient to explain the data. The two clusters co'llapsed into a si-ngle 

set 'requiring the same, common skillf ^ - ^ 

For only two analyses at age 9, analyses 6 and 17,.jflfere items within 

a cluster, found to require distinct skills. In bpth cases^, the Place 

„ . ' <i 

-.eXt^erclse requiring the examinee .to select something like "ten sevens" 

* ■ . ■ ' ^ I . ■ -I • . '» 

migrated to the other cluster-. In analysis 6, this item proved ^t.o have more 

in common with counting by ones, twos, and tens than with the other Place 

items. In-analysis 17, the same item migrated to the Geo cluster.'^ 



ERLC 



27 



2i* 

\ 

Figure 2 reveals a larger number of distinct skills than expected. 
Only two pairs of clusters collapsed, leaving seven distinct skills. 
Clusters collapsing were Units with Geo, arid SMD2 with Alg. With only one 
minor exception, the 26 runs indicated the same hierarchical pattern among 
these seven skills: At the first level are the skills Add, Sub, Count, 
Place, and Units-Geo. These may all be acquired independently, although 
possession of Sub and not Add is quite rare (2.2% of the population), as 
are possession of Count and not Add (3.8%). Place and not Add (3.^%) or 
Units-Geo and not Add (3.5%). At the second level of the hierarchy are two 
skills that may. be acquired independent" of one another, but are never found 
in the ^ absence of any of the first five skills. These are SMD2-Alg and 
Ruler. The only minor exception is in analysis 22, which indicates 2.U 
able to solve Ruler items but not Add items. These hierarchical 
dependencies may stem from two sources. First, SMD2 and Alg items may. 
require as component operations the skills entailed. in Add, Sub, Place, etc. 
Thus, the structure of the items gives rise -to a logical dependency. Second, 
elementary mathematics curricula may be so organized that the Alg-SMD2 and 
Ruler skills ^are almost never introduced until after the skills at the first 
level of the hierarchy. This would result in very few s't'uci'^n't-s-ffe-lTi-g able 
to solve the former and not the' latter. 

The next step in interpreting and integrating results of the analyses 
w'as to table the item parameter estimates, and estimates of proportions able 
'to sol,ve each item type. This tabulation is , d isplayed ^in Table 1. The ,j 
leftmost' columns of Table 1 give cluster and iten) 'identifications , followed" 
by item difficulties (p-values) calculated weighting each response according 



20 



to the WEIGHTS variable. The remaining columns give identifying information 
and results for each analysis in which the cluster was included. The TP 
and FP columns for each analysis present true positive and false positive 
rate estimates for each item. To the left of these estimates is a column 
■of identifying information and results giving the analysis number and the 
other set of items included in that analysis (line 1), the chi-square for 
the final model , (line 2), and the total proportion able to solve the 
cluster, as estimated in that analys.is (line 3). 

As can be seen in Table 1, estimates of the proportion able to solve 
the cluster were extremely stable for the clusters Add, Sub, Place, and 
Count. A wider range of estimates was obtained for Units, SMD2 , Ruler, and 
Alg, and estimates for Geo were fairly unstable. Consistency of estimates 
appears to 'be' related to the homogeneity of the exercises in the cluster. 
For the clusters that collapsed. Units - Geo and SMD2 - Alg . estimates of 
the proportion possessing the common skill were consistent across the two 
clusters. Averalge? were .42 and .41 far SMD2 and Alg", respectively, and 
.69 and .77 for Geo, and Units. Note that the discrepancy of .08 between Geo 
and Units is not large given their standard errors (on the order of .025 
for most estimatls, yielding .a _t^~cpf- 2.26)- and is much smaller than th^e^. 
difference between meipn p-values for items in the two... cluaters-^ .-63 'f or Geo 



and .81 for Units 
Estimates of 



). 



Table ,2 i-3 purely 



prop.^rtions able to solve each cluster are summarized m 



T^bl.e 2. Note that these estimates are not statistically independent, 



descriptive. The estimates for egch cLuster are not 



represeritative ofjany eji^aminee or -item populat i.oji arid, reported standard 



p 



Alg 



Table 1. Age 9 



TP 



ID2 06A 
08A 
21B 

Ruler 37A 
38A 
39A 



IBa 
31A 
3AA 







Sec/ 






Set/ 


Icem 


P 


Opp/P 


TP , 


FP 


Opp / P 


27B 


.91A 


01; Sub 


.98 


.38 


03:Place 


27D 


.863 


97.39 


.96 


.11 


60.46 


27F 


.880 


.89 


.97 


. 17 


.89 


16B 


.864 


01: Add 


.97 


.41 


02: Place 


16D 


.685 


97.39 


.CS 


.09 


93.08 


16F 


.721 


.76 


.92 


.10 


.76 


12A 


.796 


02: Sub 


.95 


.23 


03:Add 


123 


.859 


93.08 


.98 


.41 


60.46 


29A 


.791 


.78 


.89 ' 


.42 


.80 


13 A 


.925 


04: Add 


.97 


.67 


05 :Sub 


13B 




65. A9 


.90 


.22 


59.03 


13C 


.920 


.86 


.^0 


.35 


; .83 


05A 


.788 


1 13: Add 


.87 


.44 


1 lM:Sub 


05C 


.808 


} 38.97 


.92 


.35 


1 40.70 


05E 


.8-^6 


1 .31 


.89 


.66 


1 .84 


Un 


.771 


( 

} G7:SMD2 


.89 


.58 


08: Rule. 


28A 


.563 


1 52.01 


.71 


.35 


70.67 


30A 


.536 


i .60- 


.63 


.39 


.67 






1 s 









Esclmaces of Hisclasslf icacion PrqbabilUies by Item 



. J40 
.278 
.187 

.619 
. 367. 
.659 

.324 
.'234 
.240 



07:Geo 

52.01 

.45 

08: Ceo 

70.67 

.49 

09:Ceo 

49.97 

.50 



.61 
.58 
.36 

.86 
.65 
.91 

.50 
.33 
.39 



.12 
.03 
.05 

.39 
.10 
.42 

.15 
. 14 
.09 



lO:Ruler 
69.22 
.36 

10:SMD2 
69.22 
.41 - 

11:SMD2 

47.30 

.30 



.97 
.91 
.90 

.87 
.89 



.70 
.62 

.68 
.63 
.41 

.86 
.73 
.94 

.48 
.31 
.51 



FP 
.37 
.09 
.16 

.41 
.07 
.12 

.20 
.36 
.41 



Set/ 
Opp/P 



.35 
.35 
.64- 

.56 
.30 
.37 



.45 
.12 
.47 

.25 
.20 
.12 



04 : Count 

65.49 

.89 



TP 

.98 
.96 
.i97 



05 :Count .97 

59.03 .88 

.76 .92 

06 :Count .97 

62.83 .99 

.78* .90 
(. 77 , . 77, .81) 



.70 I 06:Place 
.30 62.83 
.42 . .81 



.96 
.92 
.92 



FP 

.36 
.10 
.14 

.40 
.07 
.11 

.24 
.42 
.32 



.76 
.38 



15 : Count 
42. 98 
.81 • 

09:Alg 

49.97 

.50 



.15 ll:Alg 
.08 47.30 
.06 .30 



12:Al8 

45.57 

.52 



.90 
.89 

.90 
.73 
.65 

.69 
.69 

.45 

.85 
.62 
.90 



.12:Ruler .55 
45.57 .32 
.44 .A3 



.39 
.43 
.65 

.64 
.40 
.42 

. 19 
.10 
.07 

.36 
.09 
.39 

.14 
.17 
.09 



Set/ 

Opp/P 

13:Unics 

38.97 

.89 

14 rUnlts 

40.70 

.77 

17:Cco 

50.60 

.79* 

(. 79, . 79 , 

15:Units 

42.98 

.82 

:?4:Ruler 

49.49 

.76 

16:Add • 

49.54 

.84 

18 :Add 

69.21 

.54 

22: Add 

52.81 

.66 

20: Add 

36.78 

.38 



Sec/ 



TP 


FP 


Opp/P 


-TP 


FP 


.98 


.37 


l6:Ceo. 


.98 


.37 


.96 


.08 


49.54 


.96 


.08 


.97 


.16 


.89 


.97 


.16 


.97 


.39 


19:SMD2 


.97 


.43 


.87 


.07 


79.76 


.88 


.09 


.91 


.09 


.75 


.92 


.12 


.96 


.21 


21:Alg 


.96 


.24 


.99 


.40 


40.65 


.99 


.42 


.90 


.34 


.78 


.89 


.45 


80) 










.97 


.72 


23:r.uler 


.97 


. 74 


.91 


.32 


67.89 


.91 


.31 


.V2 


.38 


.81 


. 91 


4 0 


. 9 1 




25 :Alg 


.92 


.49 


.92 


.48 


38.79 


.94 


.52 


.90 


. 70 


. 69 


. 90 


7 2 


. 84 


.40 


17 :Place 


.85 


.46 


.64 


.22 


50.60 . 


.63 


.30 


.59 


.25 


.80 


.59' 


.32 


.59 


.05 


l9:Sub 


.60 


.10 


.51 


.00 


79.76 


.57 


.02 


.3? 


.04 


.46 


.34 


.05 


.78 


. 31 


23 :Counc 


.82 


.35 


.56 


.00 


' 67.89 


.60 


.06 


.54 


.30 


.57 


.88 


.36 


.50 


. 19 


21:Place 


.53 


.15 


.35 


.16 


; 40.65 


.31 


.16 


.4 7 


. 10 


i .46 


.43 


.07 



Set/ 
Opp/P 



18SHD2 

69.21 

.89 



H 
.98 
.96 
'.97 



FP 
.37 
.09 
.15 



I. 



Set/ 
Opp/P 



20:Alg 

36.78 

.89 



TP 

.98 
.96 
.97 



KP 

.35 
.08 
.13 



Sec/ 
Opp/P 

22: Ruler 
52.81 ' 
.90 



TP ' 

.'98 
.96 
.97 



26 :Ceo 


.90 


.51 


36.88 


.94 


.49 


.72 


.89 


.73 


26 :Uriits 


.86 


.54 


36.88 


.65 


.35 


.72 


.60 


.37 


24:Unics 


.85 ' 


.38 


49.49 • 


.64 


.08 


.51 


.90 


.41 


25 :Unlts 


.57 


. 16 


38.79 . 


.33 


.1.7 


.40 


.46 


.09 



FP \§ 

'II m 



.07 
.13 



ON 



I 

I 



1, 



V «e rhP three olace Items showed distinct skill profiles. 

•Average proportion for 3 items. In these runs, che three piace 



32 



deviations should not be used to construct standard "errors or confidence " 
intervals. For convenience observed item p-values are ^al'so reproduced in 
Table 2. It may be noted that. for multiple-choice exercise clusters Place, 
Count, and Units, observed p-values were inflated due to guessing. For the 
other two multiple-choice clusters, Geo and Alg,' lower^true positive rates 
offset the higher guessing.probabilities , and true proportions able to solve 
these items exceeded the average p-vaTues for the clusters. 

In summary, the Age 9 analyses revealed seven distinct skills 
underlying the 2? exercise parts examined, two of which jwc^re acquired only 
after all' of the remaining five . While the latter five ^^^kills could be 
acquired in any order'', only a few examinees lacking the Add skill possessed 
any of the others. Cluster stability was related to the homogeneity of the 
items included. Estimates of the proportion of nine-year-olds actually 
able to .solve each item type (actually possessing the various skills)' ranged 
from to .89 (averages) or from .30 to .90 ( estimates . from individual 
analyses). These are shown^ in Tables 1 and 2. , - ■ 



Table 2. Age 9: Summary Across Runs by Cluster 
Estimates of Proportion Able to Sol ve Observed Item Difficulties 



Cluster 


N 


Mean 


S.D. 


Min 


Max 


- 1 


2 


3 


Add 


8 


.89 


.004 


.89 


.90 


.863 


.880 


.914 


Sub 


■ 5 


^76 


.007 


•75 


■■.7'7 


.685 


/. .721 


.844 


Place 


5 


.79 


.009 


,.78 . 


:8o.- 


.791 ■ 


.796 


.859 


Cou'nt 


5 


.83 


.021" 


.81 


.86 


.802 


.920 


.925 


Units 


6 T, 


.77 


.058' 


.69 ■ 


.84 ; . 


.... .788 


' .808 


_ .846 


beo ■ 


6 


.69 ' 


..,127 


.50 ' 


;..84 , : 


.536, 


.568 


.771 


SMD2 • - 


, 5 


.42 


^.093 


■ .'30 


■ .54 


.187 


.278 


■ .340 


Ruler 


6* 


.53 


/ .08:4 


. . 41 


.66 


.367 


.619 


.659 


Alg 


6 


.'■n 


.070 


.30 


.50 


.234 


.240 


.324 



Results and Interpretation: Age 13 ; 

Stem-and-le-afs of crude item difficulties for Age 13, Booklet 6 ar 
displayed in Figure 3. As for Age 9, homogeneous sets are circled and 
labeled to^ show the clusters drawn from each. 

Figure 3. Age 13 Booklet 6 Stem-and-Leafs of Crude Item 

Difficulties 
Computation Geometry 
5 



Measurement 








Other 



222 



79 
1 





T ^ ? 



In general , the Age 13' clusters are. less homogeneous than Age 9 clustersx^ 
For this reason, less descriptive labels are, used. In the following 
descriptions,' the three/ exercise parts are described in the order of their • ^ 
appearance in Aj,e .13 Bboklet 6. These descriptions can therefore be aligned, 
with the statistics tabled later in this secti/^n. Comp II included a \ 
free-response item asking what fraction of some marbles is blue (answer 
5/7), a multiple-choice item requesting the denominator of 3/5, and another 
free response item asking for the quotient in 9.5 divided by 3. The Comp 
III and Comp IV clusters are the most ' homogeneous of any at Age 13. Comp 
III includes three free-response items requesting the square roots of 9. 
49, and 25.^ Comp IV exercises were multiple-choice, requesting decimals 
equal to 1/4, 3/8, and 5/6. The "repetend" notation (e.g. , .77 for'7/9). 
appeared in the distractors for the second and* thir>d of these exercises, 
and in ^ the correct response for the third. Alg I included two number line 
exercises , '\pne free-response (Mark an X where 1.5 should be) and one 
muitipJ.ei-choice (What number is at point A), and a multiple-choice item 
\ asking Which number sentence could^be used to solve a simple addition with 

one omitted addend. The Alg II cluster included two free-response items 
requiring solutions^to simple, equations and a multiple-choice ^'translation" 
item asking which equation in X and Y expresses the idea that when two 
numbers are added the order can be chang^fed. Alg III included a 
multiple-choice exercise requiring the inference that (-3768/n) = 314 

implies n is negative, a free-response item requiring the inference 

ft . 
> » . • * ' , ■ 

f(n) = n + 5 implies f(3) = 8, and 'a graphing ^exerciseV put an X at the. point 
C3,-^). The .Geo II clust,er "consistjsd of three multiple-choice exercises. 
• requesting the number^of corners, faces,, and edges a'cubq has, given' . . 

' "illustrations of each of these terms. In Geo Ila, the first exercise was 
a composite of four exercise parts, pre-scored. Exercise parts 32A, B, C, 
and D stated that two triangles were CONGRUENT, and then presented four 

Er|c 35 



statements of attributes of congruent triangles as true-false questions 
(equal sides, equal angles, equal areas, and superposabili ty) . These were 
used to derive a single dichotomQus variable, correct if all four statements 
were answered' "true" , else incorrect. This was to provide a single, more 
reliable measure of knowledge of the concept "congruent triangles'^ The 
remaining, exercises in Geo Ila were multiple-choice, requiring the examinee 
to select line drawings i^epresentlng a cylinder and a sphere. All three 
concern knowledge of geometric terms. The last cluster, Meas, includes 
multrple-choice'" questions on the number of quarts in a gallon , 'ounces in 
a pound, and feet in a yard. (Exercise numbers assigned to each of these 
items in Booklet 6 are shown in Table 3, discussed^ below. ) ' ^ 

The 21*'analyses run on these clusters are diagrammed in Figure 4. 

\ ■ . ■ ' 

Interpretation of this figure is the same as for Figure 2, described i'n the_ 

last section'. Four of the. Algebra and Computation clusters at Age 13 

proved to measure a single skill. As shown by analyses 6, 14, and 

16, the Comp II, Comp IV, Alg II and Alg III clusters collapsed, leaving 

only six distinct skills. In spite of the greater heterogeneity of items 

within clusters at Age 13, there was only one analysis in which items within 

clusters were found to require distinct skills. This was analysis 1, 

involving the ^ Alg I and Comp II clusters.. In the final model for the six 

item5 in these two clusters, roughIy\6 percent could solve only two of the 

Comp II exercises and one oC. the Alg I Exercises, 5 percent could solve only 

the remaining two Alg I exercises, and 73 percent could solve all six 

exercises, as- shown in Figure 4. This breakdoW^ of the clusters irfanalysis 

1 as-well as the collapsing of Comp II, Comp IV-, A<lgebra II ^ and Algebra 



III indicate that/at 'Age 13 the Algebra vs . Computatron distinction could 
no.t be sustained ^empirically'. ^ . X ' 

*Age 13 analyses are numbered 1-7 and 9-22. 




S3. 



38 .- 



:e 1.3 Atv.?Lyru-^ atul j^^vopr-r^o?^'^ 'l:'^ I^■■.c^ Skill ?^\*:tnr:n 



Hierarchical relationships among Age 13 skills were as follows. The 

■# •• . . «' 

■ .y. 

skills, required for the Geo I.I, Geo Ila, Cbmp IH, and Meas clusters could 
all be acquired independently of one another, Geo Ila was prerequisite tQ 
Alg I,. and Geo Ila, Alg I, and Meas were all prerequisite to the Comp 
II-Gomp -IV-Alg-II-Alg III skill. 'That is, no one could solve-.Com.p.-.I.I-e.t-C-.-.,-.- 
items \iho could not also^ solve Geo ITa, Alg I, and Meas items, and no one 
could solve Alg I items who could not also lolve items in Geo Ila. The otily 
exception to these hierarchical patterns was'ir analysis 1, which indicated 
that 5 percent of the examinees could solve two of the three Alg I exercises 



but not two of the three Comp- II exercises. As was noted for Age 9, 
hierarchical dependencies may ari.se from at least two sources: the' logical 
dependency of ^:jre advanced content upon prerequisite or component skills, 
and the conventional structure of the curriculum, in which some skills are 
■almost universally introduced earlier than others. The' Alg I^exercises 
ibvolving number lines and a simple number sentence are probably subordinate 
t|o Alg II, Alg III, etc. for both of these reasons. The prior status of 
Meas (common unit conversions) and Geo Ila (definitions of "congruent," 
I'sphere" and "cylinder") appears more related to curricular structure than 
to content. 

/ 

Tabulations of item parameter estimates and estimates of proportions 
able to solve each i'tem type appear in Table 3. It is formatted exactly 
like Table 1, in the last section. The leftmost columns give the cluster^', 
and item identifications (from Age I3. booklet 6), and p~values. To-the right 
of each cluster appear three-column summaries fro?n a}l the analyses in whijih 
that cluster ^^appeared . These summaries include the/analysis number and the 
name of the other" cluster involved ( first column , Ijine 1), the chi square 
(first column, line 2) "and the e'stimated prpportior) able, to solve, the';^ 
cluster (first :col.umn , line'3),/as well^ as estimated true positive and false 
positive rates' f(^ each item^ in the cluster (columns 2 and 3) . The 



Tabic 2.'^^^ 13 Estimices oC MisclaSsl f icac ibn Pr_obabillcios by Item 



Clu.ster 


ICfero ' 
^5A 
lOA 


L, 


Set/ 
Opp/P 


IL 

.86 
.86 


IL 


Cor / 
act/ 

Opp/P 


II 

• 


TP 
:86 
. 85 


KP 


SetV 
Opp/P 





Comp 1 I 


.715 
.769 


01:Alg I 0 
59.38- 


.34 
.46 


02«:'teo 
36.23-^ 


.2L 
.49 


12:Meas 

o2 . / / 

. 70 




JBA * 






T8t- 


-72-5- 


-r/8 




.Hi 


. j4 










(!73.^8..78) 


















Comp lit 


■31A 


.6A9 


04:Alg II 


.96 


..05 


05:Meas 




.96 


.05 


15 : Comp 


1 1 


31B 


.601 


78.73 


.91 


.01 


59.70 




.91 


. 01 


73.10 






^IC 


.654 


.66 


.98 


.04 


.66 




. 98 


.04 


.65 ^ . 




,Comp IV 


,36A 


.378 


06;A.ig III 


.90 


.16 


18:Comp 


III 


.90 


.16 








36B 


.259" 


55.86 


.79 


.04 


87.91 




.78 


. 04 








36C 


.256 


.29 


.83 


.02 


.30 




.82 


. 02 








17B 


.779 


01: Comp- 11 


.92 


.29 


03:Cco 


II 


.93 


.37 


'13:Alg 


LI 


17C 


.801 


59.38 


.91 


.42 ■ 


46.-92 




,92 


.47 


76.13 






35A 


.733 ■ 


t!?7..77..78) 


.83 


.37 


.73 




.83 


.46 


.70 




^\A 11 


03A 


.556 


' 04: Comp III 


.78 


.18 


10: Geo 


II 


.76 


• .18 


ll:Meas 






,532 


78.73 


.72 


.22 


32.92 • 




.70 


.21 


57 . 45 






09A 


.639 


.63 


.84 


.30 


.66 




,.8,3 


.27 


. 59 




/U8 ill 


15A 


.323 


06: Comp IV 


.42 


.28 


14:Alg 


II 


.35 


.28 






23A 


.282 


55.86 


.39 


.24 


58.83 




.36 


.18 








30B 


.391 


.29 


.54 


.33 


.58 




.55 


.17 






Uo II- 


37A 


.843 


02: Comp II . 


.95 


.55 


03:Alg 


I 


.95 


. .56 


07:Meas 






378 


. 750. 


56.23 


.93 


:'24 


46.92 • 




.94 


.28 


52.84 






37C 


.727 


.74 


.91 


.20 


.71 




.92 


.24 


.69 . 




Ceo [ I A 


32ABCD 


.345 


V • 
' 09iGeo 1 1 


.48 


.11 


i9:Alg 


II 


.50 


.13 


20: Comp 


TIT 




33A 


.782 


61.00 


.96 


.48 


52.66 




.94 


.56 


68.03 






3jq 


.639 


.64 


.82 


.32 


.58 




*.87 


.32 


.62 






16B 


.633 


i 05: Comp III 


.84 


.40 


07; Geo 


II 


.84 


.40 


ll.:Alg 


IT 




16C 


.622 


59.70 




.30 


52.84 




.86 


.35 


157.45 






16D 


.652 


1 -53 


.8/ 


.40 


.53 




.91 


.36 


i .47 





Stz/ 

TP Vp Qpp/P 

35 
53 



q87 
.87 



.96 
.91 
.98 



.93 
.92 
.85 

.80 ■ 

.73 

.84 



.96 
.95 
.94 

.49 
.95 

.84 

.85 
.93 
.90 



15:Comp' m 
73.1 0 . 

T7B : 



TP 

.82 
.85 



Sec/ 
FP Qpp/P 

.32 



•TP 



■ Set/ 



Sec/ 



.46 

-72t 



.05 
.01 
.04 



.42 
.52 
.46 

.21 
.24 
.35 



.59 
.32 
.26 

. 10 
.51 
.30 

.43 
.35 
.43 



17:Alg I 

68.62 

.66 



17:Comp III 

68.62 

.76 

niAly I ^ 

76.13 

.63 



61.00 
.70 



! 21:Alg 
; b6.55 
1 .66 

} 

j 12: Comp 1 1 
{62.77 
I .58 



.96 
.91 
.98 



.92 
.93 
.82 



.77 
.70 
.85 



.05 
.01 
.03 



.35 
.40 
.47 

.19 
.25 
.29 



l6:Alg II 
67.40 

T58 



737 
.53 



18: Comp IV 

87.91 

.65 



.96 
.91 
.98 



.05 
.01 
.03 



.82 
.88 
.85 



.38 
.27 
.38 



21:Geo Ila .94 
66.55 .92 
.73 .83 

14:Alg III .79 
58.83 .73 
.58 .88 



.36 
.49 
.48 

.24 
.27 
.31 



rCeo lla 
.80 



■Tt 

20:.Ceo II« 
38.03 
65 



.6: Comp II 
i7.40 
68 



lla 


.96 


.58 


10:Alg 11 


.95 


.56 




.95 


.29 


32.92 


.94 


.25 . 




.93 


.27 


.72 


.91 


. 24 


I 


.47 


.10 


22: Comp II 


.47 


.09 * 




.93 


.50*^ 


65.80 


.93 


.49 




.83 


.27 


.67 


.83 


.26 



IL 


IL 


Opp/P 


■ .85 


.31 




^ .86 


.49 




—89- 






.96 


.05 




.91, 


.01 




.98 


.03 




























.76* 


. 13 


19rGeo iTa 


.68 . 


. 22 




.81 * 


. 27 

















TP FP 



.75 .17 
.69 •.i2 
.83 .27 



'.Average proportion for 3 Icoms. In this ru... che Alg 1 and Co.p II Ue,ns showed distinct skill profiles uithln sets. 



40 



41 



stability of the true positive and false positive estimates is closely 
related to the " stability of the estimated proportion able to solve the 
cluster. These latter estimates for each cluster are summarized in Table 
4, which also repeats the observed p-values for each item. For all the Age 
13 clusters exxiept Alg I, estimated proportions,.able to solve were as high 
or higher than average' item difficulties, indicating that random guessing 
is relatively rare among 13-year-olds. (Guessing on multiple-choice^ items 
Increases p-values, so that average item difficulties exceed the proportion 
possessing, the skill.) Lower true positive rates for exercises at this age 
level indicate that more of the individual exercises examined present unique 
difficulties, or-require specific information not shared with other items' . 
in the cluster. " Thus, examinees possessing the skills these items require 
in common may nonetheless err on one or another of the separate exercises, 
Table M shows that estimated proportions able to solve ar^e not in cl" 
agreement for exercises in the four clusters that collapsed -to a .singl 
skilV. They range from .30' (Comp ,IV) through .U4 (Alg III) and .63/CAlg 
Ip to .7^ (Comp II). The structure of the set of Age 13 analys^-fe implies 
that for at least one of these clusters', estimates from diffe/ent analyses 
must have been quite variable, and, in fact, the problem a.{^ears to be wi^h the Alg 
cluster alone (see standard deviations in Table 4). Itytan be seen in Figur^\4 that 
Alg HI was used in only two analyses, 14 (with Alg/CI)- and 6 (with Comp IV). In 
sCach of these analyses, a two-class model gave aj/ acceptable fit, indicating,.-a - ^ 
single common skill for all' three clusters. Ik/iever, the proportion able to solve 
Alg II and Alg III was estimated as .578, while the proportion able to solve 
Comp IV and Alg III was estimated as only .294. Additional analyses would 
be required, to probe the reasons for this, anomoly. In particular, it would . 
be useful to analyze Alg II and Comp IV together. Ruhs for analysis sets 
6 and 14 in which the proportion possessing ;tho common skill was constrained 
to equal" a constant would also be -informative. Work with 'other item sets 



42 



' . 35 

.' /ruble A 









13: 


Summary 


Across 


Runs l-)y Cluster 




Estimates 


of 


Proportion Able to Solve 


Observed 


Item Difficulties 


Cluster 


N 


Mean 


S.D. 


Mm 


Max 


1 


0 




Comp XT 


6' 


~7 A 


. U 4 ^ 


. 00 


7ft 


715 


.751" 


.769 


Comp III 


6 


, 66 


. [JUD 


A ^ 
. 03 


A A 
. DO 


.601 


6/4*9 


. 654 


Comp IV 


2 


. 3U 


. UU / 


9 Q 




.256 


.259 


. 378 


Alg I 


5 


- .7A 


. 028 


.70 


. 77 


, ..733 


7 7 Q 


ftOl 


Alg II 


7 


.63 


.037 


.58 


.68 . 


.532 


.556 


.639 


Alg III 


2 


.A4 


.205 


.29 


.58 


.282 


.323 


.391 


Geo II 


5 


..92 


.013 


. ' .91 


.94 


.727 


.750. 


.843 


Geo Ila 


5 


.8A 


.019 


.82 


.87 


.345 


.639 


.782 


Meas 




.88 


.0.28 


.85 


.91 


. .622 


.633 


.652 



/ 
/ 



I 
I 

/ 



43 



f 

indicates that the two-class model m^y be relatively insensitive to such 
constraints, i:e., that the chi square may increase only slightly if 
different latent class proportions are fixed in advance. (This has not been 

Tound for more complex models.) Another possible explanation is found in 
the d ec i-5-ie-n — r ulre-s-fo- r - the s e F4-e^Q-f-H=^^-»•s— f^t^-a^-a-^- a lysis. 1 -n-ana-lyg-e^-an.d— U-,- 

• two-class chi squares of 56 and 59, respectively, were obtained (see Table 
3). Thus, no additional .runs weV'e performed on these item sets. The. 
introduction of; a third latent class might have resulted in substantial 
reduction of (already nonsignificant) values leading to new "final" 

models in whichj Alg II, Alg IV, and Comp IV were hierarchically related. 
This would resolve the discrepancies among Alg II, Alg .III, Comp II, and 
Comp IV » Furth-er work on the decision rules is called, for. Ilierarcnxcal 
organization among Alg II, Alg III, and Comp IV is plausible, ^ " 
and would extend the hierarchical relationship observed between Alg I and 
Alg II.' This interpretation, that-bhe marginally satilPactory 2-class fits 
of analyses 6 and 14 should not have been accepted, is also supported by 
the nearly hierarchical organization of the Comp II, Comp III and Cbmp IV 
clusters shown in analyses 15 and 18. The fact that only 6 percent of the 
examinees could solve Comp IV and not Comp III and only 7 percent could 
solve Comp III and not Comp II but percentages in the other direction are 
m percent and 20 percent makes it seem unlikely that Comp II and Comp IV 
require the same skills. 

' In summary, the Age 13 analyses showed that four of the nine clusters, 
Comp II, Comp IV, Alg II, and Alg III required a single skill. This finding 
is suspect; however, on two grounds. First, a nearly hierarchical 
relationship was found among Comp II, Comp III, and Comp IV. Second, the 
Alg III cluster was included in only two analyses, which yielded quite 
'different estimates of the proportion able- to solve Alg III items. These 
were two of -the three runs indicating that Alg II, .Comp II, Alg III, and 



44 



- • -r . ■ ' \ . 

Comp ly required a common skill, Reconsideration of analyses 6 an^d 1U would 

• ■ \ 

require changing the a priori decision rules which^ governed all th^e 'Age 

' ^ ^ > ' ■ . ^- • / 

9 and Age 13 analyses reported. However, further inve-stigation of these 

findj/ngs is called for. Assuming that further runs would indicate 

a hierarchical relahloh.'^hi p among— Alg II, A1g III, a.i:ui->~Comp~~I4/V-trh-e — 

equivalence established between Alg II and Comp II would still stand. Thus, 

no more than 8 .skills would be required to explain item response patterns 

in these data, .with the great majority of 13-year-olds mastering first. Meas 

and Geo Ila, then Alg I, and then progressing to Alg Il-Comp II, then on 

to Alg III and Comp III, followed by Comp IV. Geo II does not appear in 

the hierarchy but is roughly parallel to Geo Ila in difficulty and time of 

acquisition. Estimates of the proportion of examinees possessing these 

skills ranged from .36 (Comp'lV) to .92 (Geo). These are shown in Table 

Results and Interpretation: Age 17 

At Age. 17, a different analytic strategy was followed.- Rather than 
mapping relationships among homogeneous item triples in a single booklet,- 
sets of six exercises were drawn from each of six separate booklets. All 
these 36 items focussed on a common,' bro'adly conceived skill, and no two 
were parts of the same exercise. ' Rigid rules were not followed in carrying 
out these analy.ses. Rather, each series of runs was pursued until all 
ad^ditional latent classes suggested by examination of residuals had been 
tried. Of course, no latent classes were retained unless their introduction 
resulted in a statistically significant improvement in the fit of the model. 

There were several reasons for adapting a different strategy at Age 
17. First, patterns of findings at Age 13 suggested that it. might b^ 
premature to stop with the first run yielding a nonsignificant chi square. 



•15 



Second, .the large number of distinct skills found at both Age 9 and 'Age 13 

'suggested that the skills obtained might be of limited generalizability . 
It had' been anticipated that fewer, broader skills would -be identified. 
In order to icjentify broader skills, the six-item sets at Age 17 were each' 

-fer m e d frof n six e x e rcis e p a!^t-5—te4^m^a^ p c ar c d to r cgujrP-e— t- hc aamc c offlmofii 
concept, that letters can replace numbers- in statements of equations and 
inequalities. (At Ages 9 and 13, three items for each of two concepts had 
been used in each set.) To minimize common method variation that might give i 
spurious skill distinctions, only the "A" parts of multi-part exercises were 
included. Replication across six separate examinee samples (assured by 
sampling of exercises from six different booklets) also promised to increase 
generalizability, and permitted significance testing to compare 
estimates of latent class proportions from different exercise sets, since 
these" estimates could be assumed statistically independent; / 

-To identify the exercise sets examined at Age 17, the Age 17 appendice^s 
were examined and all exercises were noted that ^appeared to require an 
understanding that in that exercise, letters took the place of numbers. 

. The 88 exercise parts identified (-from 7^ distinct exercises) included some 
classified as algebraic manipulation (solving equations, simplifying- and 
factoring, plotting, graphs) , mathematical skills and computation, numbers 
and numeration, understand in^g-f" translation-, and other topics . Actual 
exercise parts were then examined on microfiche, and. roughly 20 were 

"rejected as not relying on the common skill. This left just six of the 
twelve Age 17 booklets with at least six acceptable exercises. Six 
exercises were drawn from each" of-' these booklets, ,as shown -,10 Table 5. The 
smaller numbe> of six-item sets analyzed at Age 17 does not imply a lower 
commitment of either time or computer reso^urces. More distinct exercise 
parts are involved at Age 17 than at either Age 9 or Age 13, it was 

• necessary to access more tape files', because multiple booklets we're involved. 



,39 

.• ■ . Table 5 

Exercises Examined in Age 17 Analyses, p-values, and LaCenC Classes Required 

Latent Classes*- 



Set Booklet Exercise P NULL REAS INf ALL 



2 



ni 




05A 


.233 


0 






1 






09A 


.568 


0 




• 


1 






'22A 


.041 


.0 ^ 






1 






24A 


.706 


0 






1, 


• 




33A 


.662 


0 






1 






39A 


.796 


0". 






*1 


2 

^0 = 


59. 


29. ■proportions = 


= .574; 






, .-^26 


02 




02A 


.725 


0 


0 


1 


1 






04A 


.048 


0 


0 


0 


1 






09A 


.449 


0 . 


0 


0 


1 . 






18A 


' .702 


0 


1 


1 


1 






27A 


.883 


0 


1 


1 


1 






30A. 


.473 


^0 


1 


0 


1 


2 


64. 


33. Proportions 


= .394 


.066 


.083 


.457 


03 




^ 03A 


.521 


0 


1 


1 


1 






14A " 


.174 


0 


'0 


0 


1 






18A 


.788 


0 


1 


1 


1 






23A 


.581 


0 


1 


I 


1 






35A • 


.187 


0 


0 


0 


1 






38A 


. 26'2'' 


0 


0 


i 


1 


2 
^48 


52 


.37. Proportions 


= .582 


.104 


.057 


.256 


04 




05A 


..1.24 


0 


> 


0 


1 






. 13A 


.580 


^0 




1 


1 






20A 


.490 


0 


1 


0 


1 






26A 


.686 


0 


1 


• 1 


1 






29A 


.075 


. 0 


0 


0 


1 






35A 


.578 


0 


; 1 


1 


- 1 


2 
\8 


63 


. 08. Proportions 


= .553 


....112 


.152 


.184 


07 




07 A 


^343 


. 0 


0 




1 






17 A 


.370 


a 


0 




1 






2AA 


.531 


0 


0. 




1 






, 28A 


' .526 


0 


1 ■, 




1 






33A 


.470 


0 


1 ■ 




]. 






38 A 


.193 


0 


0 




1 



• = 62,08. Proportions^ = .565 .077 -358 
49 

v< a' ••1" indicates tlmreKiiinees in that latent class could solve the corresponding 
item; "0" indicates that they could ,not. 



ERIC 




y| Table 5 (Cont.) 



Exercises' Examined in Age 17 Analyses, p-values. and Latent Classes Required 



Latent Classes'^ 



Set 
6 



iJookleC" 



09 



'Exercist:! 

05A 

06A 

19A 

2 OA • 

21A 

33A 



-WLL FUiAS- 



AT J, 



.178 
.382 
.368 
.392 
.561 
.629 



0 
0 
0 
0 
0 
0 



x~ =93.23. Proportions = .520 
49 



0 
0 
1 
0 
1 
1. 



.06.1 



1 
1 
1 
1 
1 
1 



• A 19 



48 



and most important, the absence of distinct, homogeneous clusters made ' 
actual fitting more difficult. 

Results of the six analyses will be discussed in order- of increasing 
complexity. Set' 1 yielded a . sa^^f actory fit to. the simplest, two-class 

e4^^4--^^a^e— a^^^^e^-yr mat o d lat e nt e 4ra-s^ proporti -efvs— i^of'— fe-M-s— at^ — 
other analyses are displayed in Table 5. It should be noted thab exercise 
p-values for set .1 ranged from .04 to .80, which makes it appear unlikely 
that a single skill distinction could explain examinee performance across 
the entire set._ In fact, the true positive rate for exercise 22A (with a 
p-vaLue^f .OU) was estimated as only .10. It may be more reasonable, to 
think of this item as requiring some additional skill not required by the 
remaining five items in the set, rather than supposing that it requires the 
same skill but that those possessing the requisite skill had only one chance 
in ten of answering it correctlyi These two interpretations are, in fact, 
interchan-geable . They are .statistically equivalent, yielding identical 
predicted response pattern frequencies, residuals , and .chi squares. The 
more reasonable model, in which item 22/V alone requires some, additional 
skill., would include three latent classes, for examinees who could solve 
none of the exercises, all but 22A', and, all of the exerci^ses. This model, 
however , is .said . to be not identified. Its parameters cannot be estimated 
because they cannot be uniquely determined. In particular, th^re is no way, 
with this model, to distinguish a student in the second class, who gives a 
false po'sitive response to item 22A from a student in the third class who ■ 
gives-a true positive. By making compensating adjustments in the latent 
class proportions and item 22A's misclassif ication probabilities, identical 
predicted response pattern proportionb can 'be obtained. The two-class model 
actually fitted to set 1 represents one possible set of parameter estimates 
for the nonidentified three-class model. Another set could be obtained by 
fixing the true positive rate 'fop^k^if^ten^'^^yH at 1.00, in which case latent 



class proportions' would be .57^ (null) , .385 (all but;22A), and .0M1 (all 
items). Fixing the'^second Tatent class proportion at any value between 0. 
and ';385 or fixing' the true positive rate for;22A- at .any value between .096 
and 1 would yield an equivalent model. .In conclusion ,. the analysis for set 

1 might b e inter p re tretHa s showing a si -n g -l-e . ski^^-^^^s-fe-i-f^e-t^'On-wi-t-h^^^ — — 

extremely low true positive- rate for one item, or as showing that all items 
except 22A require the same skill and the latter requires a unique ., 
additional skill . 

^ The next two sets, in order of complexity, were sets 5 and 6, for which 
the final models included three latent crlasses.' For set 6, the final model 

it relatively poorly, as indicated by a chi square of 93.23 on 50 degrees 
ofXfreedom. No additional classes could be' found which reduced this value.. . 
As. shown in Table 5, the skill distinctions found in sets 5 and 6 would not 
be pred\x:ted on the basis of item p-values'. While any interpretation of 
these skir]^ distinctions ..would require cross-validation, it appears in each- 
case that examinees in the intermediate state can solve just those items 
that could be answered correctly by , reasoning , without formal training in 
algebra. These aVve non-routine problems, multi-step word problems, and 
items testing number\^and numeration concepts in a form different than that 
typically. encountered\n instruction, and may be more related' to general 
intelligence and less toX^hooling than the other items in sets 5 and 6. 
No explanation is offered for the relatively poor fit of set 6. 

The remaining sets all rV^uired ' four latent classes for an adequate 
fit. Sets 3 and 4 are treated Vxirst , because the latent classes in these 
..xuns ^conformed to a hierarchical s^kill pattern. As shown in Table 5, for 

set 3 examinees could solve al4 of the items, all but I^^A and,35A, ail but 

\ 

1^^A, 35A and 38A, or noi^e of the itemsX^ A similar pattern emerged -for set 

i|. As for sets 5 and 6, skill distinctions appear to ^reflect the 

■ ^ \ ■ . 

solubility of items by reasoning as opposec^^•to formal, training. Examinees- 



in intermediate states can. soiye items concerning abstract properties of 
the number system, especially multiple-choice- exercises, but fail to^solve 
routine computational exercises . that appear to depend upon specific, formal 

rmm-n-g- tn -algebra. Unlike sets 5 a.nd..^, the patterns of intermediate 
states in thes -g^two^s ets cor r e 'sp- o r id p r eei seiby-txx^he-^trterns i-ndi-cra-tred-by- 
Item difficulties. In set 3. for example, 10. U% can solve only items with 
p-values above .5, an additional 5.7% can solve all items with p~values 
above .2, and 25. 6%^ can solve all items. *' ' 

Set 2- yielded the most complex pattern of latent classes. Since the 
intermediate states are not nested hierarchically, data f rom^ set 2 are., not 
compatible with the assumption of an ordered, unidimensional continuum of 
content knowledgeor content acquisition. It *ay be no accident that set 
2 is also the ea'siest set of items. , In ojbher work with reading 
comprehension item response data, th"e author has found similar anomolies 

with very easy items, at is tempting to assume that these items are 

\ ' ' ■ 

amenable to Volution by several strategies. Some students have -a strategy 

that works for items 02A., 18A; and 27A, while ^o^hers have a strategy for 

18A, 27A, and 30A. Unfortunately, it is not clear what these st^tegies 

would be, and it appears 'unlikely that many researchers. would predict in - 

advance that these two intermediate classes would 'be, the ones- to emerge. 

The pattern" of latent plasses here is not consistent with patterns of item 

difficrulty. . * ' ' , 

Despite .the variety of models fitted to these six item sets, some 

importan.t consistencies do" emerge. Note first the estimated proportions 

in the NULL 'classes for the sixjians, those unable bo solve any of the 

items. With the exception of the' anomalous value for the set 2 (the^ last 

set just discussed), these esltimates fall in the narrow range from .520 to 

.582. Even the. .conservative procedure of testing the difference, between t^ 

o 

smallest of these and the largest yields a non-significant t of 1.86. 



This indicates/that the NOLL proportion was consistent across 'sets 1 and 
3-6, within sampling error. As discussed above, for the sets in which 
intermediate e^lasses enierged , one category typicallyv appeared to be of 
students whb could solve non-routine or multiple-choice exercisiss but not 
conventional algebra exergises. Kor each of s^s 2-6, th^-^hTt-er^Tn-eii^^ 
class that appeared to best represent . this pattern was identified. In Table 
5, patterns for this class appear in' the column headed "Reas", Estimated 
proportions Tor this class ranged from .061 to .112, As for the NULL 
proportion estimates, the conservative t-test of the smallest against- the 
largest of these estimates was ^ calculated , A non-significant value of 1.27 
was obtained, indicating that the' a^^umption of a common skill pattern, 
detected across analyses, could not be rejecte'd,^ Roughly 8 percent of 
examinees .could correctly answer items soluble by reasoning, but not those 
requiring formal training in algebra. Further investigation of the 
course-taking patterns of these students is clearly warranted. " *^ 
The final commonality observed across these analyses concerns the ■ 
unidimensionality of the skill continuum. In all but set 2, latent classes 
showed a Guttman scale pattern. As. noted earlier, the anomoly in set 2 was 
for items markedly easier- than those In the other five sets. Thus, it may 
be concluded that a single continuum of content acquisition underlies -all 
moderately 'dif'fi.cult to difficult exercises involving the use W letters 
to repres<Snt real hijmbers. As with the hierarchical relationships detected 
. at ages 9 and 13, this continuum probably results .from both the logical 
structure of the exercise content and the conventional structure of the 
..mathematics curriculum in American schools. 



V 



a 



52 



Conclu3ions " 

De-v-eloplng appropriate, sensitive methods to analyze the National 
Assessment data is a difficult challenge. If analysis and reporting are 
to go beyond the. level of individual items.,, new methods must be developed. 



or existing methods adapted to a matrix-sampled data base in which exercises 
are conceived not as measures of a few common, underlying traits, but as. 

■ ' ■ \ , - ' ' ■ . ■ • 

Representing many distinct classes of speci\fic performances, each ^of , 
.interest in its own right. In other words , ^fndividual NAEP exercises 

was specified. 

because it was of some intrinsic interest. Classical test thebry, with its 



represent more or less distinct, content "domait^s , each of which 



focus on the detection of stable individual-difference variatijon against 
a background of measurement error, is ill-suited to the analysis of content- 
or domain--r:eferenced ' instruments . Newer Item response theory (IRT, or latent 
Vait) models are as yet only practical when applied to large numbers of 

items that may be assumed unidamensional . The large number of distinct, 

I 

independent skills detected in this investigation strongly suggest that IRT 

.V 

models will be of limited value . ■ - 



Both classical and IRT mq'dels begin with the simplifying assumption 

that items measure a common underlying skill - an assumption of 

unidimeh^ionality . In this sttfciy. latent class' models rather than latent 

/ 

trait models were used, and a different simplifying assumption was involved 
— that examinees may be classified as to whether they candor cannot solve 
each individual item^ and that these two possible classifications correspond 
^ to just two dis-tinct probabilities of a correct response to the item. Other, 
assumptions , ^ especially conditional independence , are the same for latent 
class and IRT models. The latent class models give considerable flexibility 
in accounting^or the specific characteristics of individual items, and 



53 



ERiC 



. :, '46 ■ 

also permit 'the' representation of multidimensional structures of skill 
possession, or non-linear patterns of skill acquisition. 

This study has demonstrated the utility of latent class models in 
describing NAEP data, but has also highlighted technical problems in need 
of further investigation. ^Significance testing when data are from a 
stratified cfuster sample is an unsolved problem. The design effect 
adjustment used in this study is clearly not the best possible. The only 
estimates presently available are maximum likelihood, and whi\le some 
•'large-sample properties of these estimates are known, further work on the 
problem of estimation is called for. Even more important work is 
needed on the problem of model selection. The methods developed in this 
study for the analysis of residuals are of considerable help in finding the 
latent classes required to represent a given set of response patterns. 
However, these methods cannot as yet be automated, and are far from 
infallible. Beyond- the problem of modeling individual exercise sets is the 
broader problem of selecting exercise sets for analysis. The strategy 
followed at ages 9 and 13 probably yielded clusters that were too tight, 
sharing more common method variance than desirable. The' strategy used at 
age 17, on the other hand, yielded broader exercise clusters and facilitated 
replication \across booklets, but did not yield as comprehensive a map of 
exercise ski\^l reqOirements Extension of computational methods to 
accommodate larger numbers of items would help to solve this problem. 

Despite present limitations of the methodology, some important 
generalizations do emerge from the analyses reported. At age 9* roughly 
89 percent'of the children could do simple addition problems correctly. 
Subtraction, counting, place value problems, unit conversions, and simple 
geometric terms and concepts were all relatively independent of one another, 
i.e., required, for the most part, distinct skills. These all tended to be 
acquired after the addition skill. Roughly 77' percent of nine-year-olds 



54 . 



could solve problems of these kinds. More difficult were two-digit 
subtraction, multiplication, and divison problems , .number line and number- 
sentence problems, all available to roughly ^^ percent o*f the population, 
and hierarchically depen.dent upon the easier skills. The skills required 
to use a ruler were only slightly easier (53 percent) and also tended to 
be acquired after the less difficult skills. 

At Age 13, several levels of .'Algebra/Computation skills were detected, 
but were not clearly d'istinguishe^d . The skills required to work with number 



lines and number sentences, to express part of a small set as a 



common 



fraction, and to do a simple lorpg division problem were available 74 percent 
of the sample. Sixty-six perce^nt could interpret the radical sign, but only 
30 percent could give the decimal equivalents of common fractions. These 



'skills were not strictly hierarchical, but only a few percent of the 
examinees possessed the easier skills who lacked the more difficult ones. 
The skills required for a variety of algebra exercises (including topics 
in numbers and numeration) were available to 40 to 60, percent of the 
examinees. As' with! computation, these skills were nearly hierarchical. 
Relatively independent skills were required for simple unit conversions 

(possessed by 88 percent iof the examinees) and geometry facts and concepts 

c "I " ■ 

•(84 and 92 percent, respectively). 



The Age 17 analyses 
comprehensive map of ski 



were structured differently, and did not yield a 
lis po;fesession. Only algebra exercises were 



examined, but these were drav/n from six different booklets. It was found 

i /• 

that 52 percent of H-yei^r-^lds^ were unable to solve any items in which 
letters represented •varia.bXe numerical quantities. Another 8'percent could 

solve problems of this ki)^d that did not appear to depend heavily upon formal 

/ 

training in algebra, but/were unable to solve more routine algebra problems. 



For all but the easiest 



of g linear skill hier,^ 



exercises , data were compaLible with the assumption 



chy , 



The Age 17 strat^gjy for planning exercise sets could be frujitfully 



55 



■TO- 

, applied to other broad topics. Further work with th6se models and'methods 
should increase our understanding of the academic capabilities of Atoerican 
youth. • 



ERICN 



56 



Reference s • 

Bock, R.D. & Lieberman, M. Fitting a response rmDciel for n dichotomously ^ 

scored items. Psychometrika , 1970, 35, 179-197. 
Dayton, CM. & Macready, G.B. A probabilistic model for validation of 

behavioral hierarchies. Psychometrika , 1976, Al, 189-204.. 
Goodman, L.A. A new model for scaling response patterns: An application 

of the quasi-*independence , cpncept . Journal of American Statistical 

Association , 1975, 70(352), 755-768. 
Haertel, E.H. Determining what is measured by multiple choice te^^ts of 

reading comprehension . Unpublished doctoral dissertation. University 

of Chicago, 1980. , 
Lazarsfeld, P.F. & Henry, N.W. Latent structure analysis . New /ork: 

Houghton Mifflin, 1968. 
Proctor, C.H. A probabilistic formulation and statistical analysis for 

Guttman scaling. ^ Psychometrika , 1970, 3^, 73-78. 

I 
' 
I 
I 



57 



